set.seed(0911)
library(ggplot2)
library(gridExtra)
library(cowplot)
library(plotly) # interactif plot
library(ggfortify) # diagnostic plot
library(forestmodel) # plot odd ratio
library(arm) # binnedplot diagnostic plot in GLM
library(knitr)
library(dplyr)
library(tidyverse)
library(tidymodels)
library(broom) # funtion augment to add columns to the original data that was modeled
library(effects) # plot effect of covariate/factor
library(questionr) # odd ratio
library(lmtest) # LRtest
library(survey) # Wald test
library(vcdExtra) # deviance test
library(rsample) # for data splitting
library(glmnet)
library(nnet) # multinom, glm
library(caret)
library(ROCR)
#library(PRROC) autre package pour courbe roc et courbe pr
library(ISLR) # dataset for statistical learning
ggplot2::theme_set(ggplot2::theme_light())# Set the graphical themeThe optimal prediction of \(Y\) conditionnaly on \(x\) is the regression function \(h(x)=\text{E}[Y\vert x]\). In previous chapter, we assumed \(h(x)\) is linear with respect to \(x\) : \[h(x)=\text{E}[Y\vert x]=x^T\beta, \quad\text{where}\quad Y=x^T\beta+\epsilon\text{, with}\epsilon\sim\mathcal{N}(0,\sigma^2).\]
😕 This linear assumption, however, limits us to continuous responses and does not handle categorical outcomes, thus restricting our ability to address classification tasks.
To broaden our approach, we introduce models that maintain a linear predictor \(\,\,\eta(x)=x^T\beta\) but incorporate a link function \(g\) to map the conditional expectation of \(Y\) to the linear predictor. Specifically, we assume \[ g(\text{E}_\beta[Y\vert x])=x^T\beta, \] where \(g(\cdot)=h^{-1}\) is known as the link function. Consequently, we can express the conditional expectation as \[ \text{E}[Y\vert x]=g^{-1}(\eta(x))=g^{-1}(x^T\beta). \] This formulation, which includes the linear predictor with a link function, allows for greater flexibility in modeling responses of various types, including binary and categorical outcomes.
[Definition 1]
A random variable \(Y\) is
said to have a probability density with respect to a dominant measure
\(\nu\), denoted by \(f_{\theta,\phi}\), and to belong to the
natural exponential family \(\mathcal{F}_\theta^{Nat}\) if \(f_{\theta,\phi}\) can be expressed as \[f_{\theta,\phi}(y)=\exp\left(\frac{y\theta-b(\theta)}{\phi}+c(y,\phi)\right),
\] where \(b(\cdot)\) and \(c(\cdot)\) are known and differentiable
functions, satisfying the following properties:
[Proposition 1]
If \(\,Y\) admits a density
belonging to the natural exponential family \(\mathcal{F}_\theta^{Nat}\) then
METHOD
1.\(\,\) Select the conditional distribution of \(Y\vert x\). Choose a probability distribution for \(Y\vert x\) from the natural exponential family.
2.\(\,\) Define the linear predictor. Set \(\eta(x):=x^T\beta\) and select an appropriate link function. Often, the canonical link function is used for simplicity and interpretability.
3.\(\,\) Estimate the parameter \(\beta\). Obtain an estimate \(\widehat{\beta}_n\) of the unknown parameter \(\beta\) using a sample \(\mathcal{D}_n\{(Y_i,x_i)\}_{i=1,\cdots,n}\). Then, the estimated mean response is given by: \[g^{-1}(X\widehat{\beta}_n)\quad\text{where}\quad X=(x_1,\cdots,x_n)^T.\]
Let \(Y=(Y_1,\cdots,Y_n)^\top\) represent the response vector, and define the design matrix as: \[X = \begin{pmatrix} x_{11}&\cdots &x_{1p} \\ \vdots& &\vdots\\ x_{n1}&\cdots &x_{np} \end{pmatrix}= (X_1,\cdots,X_p)=\begin{pmatrix} x_{1}^T\\ \vdots\\ x_{n}^T \end{pmatrix},\] where each \(X_j\), \(j=1\cdots,p\) represents one of the explanatory variables.
Let \(\mathcal{L}(\beta)\) denote the log of the likelihood function. Given that the \(Y_i\) are independent, we have: \[\mathcal{L}(\beta)=\sum_{i=1}^n\log f_{\theta_i,\phi}(Y_i)=\sum_{i=1}^n\mathcal{L}_i(\beta), \] where \(\mathcal{L}_i(\beta)\) is the contribution of the \(i^\text{th}\) observation \((Y_i,x_i)\) to the log of the likelihood: \[\mathcal{L}_i(\beta)=\ell(Y_i,\theta_i,\phi,\beta)=\log f_{\theta_i,\phi}(Y_i)=\frac{Y_i\theta_i-b(\theta_i)}{\phi}+c(Y_i,\phi).\]
In practice, models frequently use the canonical link function for simplicity and interpretability.
[Definition 2]
Suppose \(Y\) is a random
variable with a density belonging to the natural exponential family
\(\mathcal{F}_\theta^{Nat}\), \(s.t.\) \[\text{E}_\theta[Y]=b'(\theta)=\mu,
\] then the funtion \(g(\mu)=(b')^{-1}(\mu)\) is defined as
the the canonical link.
With the canonical link defined, it is possible to simplify the expressions in the likelihood equations. This leads to a more efficient formulation for estimating parameters, as shown in the following proposition.
[Proposition 3]
For the canonical link, the likelihood equations are
simplified: \[
\sum_{i=1}^n\frac{(Y_i-\mu_i)x_{i,j}}{\phi} =0,\quad j=1,\cdots,p.
\]
The table below summarizes some common distributions along with their respective canonical link functions and names. Here, we denote by \(\mu(x)=\text{E}[Y\vert x]=g^{-1}(\eta(x))=g^{-1}(x^T\beta).\)
| Law | \(\mathcal{B}(p)\)–\(\mathcal{B}\text{in}(N,p)\) | \(\mathcal{P}(\lambda)\) | \(\text{Neg}\mathcal{B}\text{in}(k,p)\) | \(\mathcal{N}(\mu,\sigma)\) | \(\Gamma(\alpha,\beta)\) |
|---|---|---|---|---|---|
| Canonical Link | \(\text{logit}(\mu)\) | \(\log(\mu)\) | \(\log_k(\mu)\) | \(\quad\mu\quad\) | \(\quad-\frac{1}{\mu}\quad\) |
| Name link | Logit | Log | \(\text{Log}_k\) | Identity | Reciprocal |
where \[
\text{logit}(\mu)=\log\left(\frac{\mu}{N-\mu}\right)\quad\text{and}\quad\log_k(\mu)=\log\left(\frac{\mu}{k+\mu}\right)
\]
📝 Remarks.
- When using a “logit link,” the model is referred to as logistic regression. Similarly, with a “logarithmic link,” we have Poisson regression.
- Other non-canonical link functions are also used in practice. For example, the probit link: \(g(\mu)=\Phi^{-1}(\mu)\) where \(\Phi(\cdot)\) is the standard Gaussian distribution function, and the log-log link: \(g(\mu)= \log(-\log(1-\mu))\) with \(\mu\in(0,1)\).
[Theorem 1]
Under certain regularity conditions, the maximum likelihood estimator
\(\widehat{\beta}_n^{ML}\) defined by
\[\widehat{\beta}_n^{ML}:=\underset{\beta}{\arg\max}\,\sum_{i=1}^n\frac{Y_ix_i^T\beta-b(x_i^T\beta)}{\phi}\]
has the following asymptotic properties:
Moreover, \[I^{1/2}(\widehat{\beta}_n^{ML})\sqrt{n}( \widehat{\beta}_n^{ML}-\beta_0)\overset{\mathcal{D}\ under \ \mathbb P_{\beta_0}}{\longrightarrow}\mathcal{N}(\boldsymbol{0}_p,\textbf{I}_p). \]
Ⓡ To illustrate the
differences between logistic and linear regression, consider a binary
response variable \(Y\), where \(Y\) takes values in \(\{0,1\}\). In this context, \(Y\) represents whether a payment
default occurs.
Linear Regression Model We begin with a simple linear regression model to predict the probability of default.
Ⓡ The following
code visualizes a linear regression fit for predicting \(Y\) as a function of
balance:
p1 <- ISLR::Default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "lm") +
ggtitle("Linear regression model fit") +
xlab("Balance") +
ylab("Probability of Default")
ggplotly(p1)📈 The plot suggests that the linear model may not be suitable for this binary outcome, as it can predict probabilities outside the \([0,1]\) range. Therefore, selecting an appropriate distribution for\(Y\vert x\) is crucial; a Bernoulli distribution for \(Y\) conditioned on \(x\) is more appropriate in this case, with: \[ p(x)=P(Y=1\vert x)\text{ and } \mu(x)=\text{E}[Y\vert x]=p(x). \]
Logistic Regression Model To correctly model the relationship, we choose the canonical logit link function: \[g(\mu(x))=g(p(x))=\text{logit}(p(x))=\log\left(\frac{p(x)}{1-p(x)}\right).\] With \(\eta(x)=x^T\beta\) and \(\widehat{\beta}_n\) as an estimator of \(\beta\) from \(n\) observations, we estimate \(\text{E}[Y\vert x]=p(x)\) using: \[\widehat{p}(x)=g^{-1}(\widehat{\eta}(x))=g^{-1}(x^T\widehat{\beta}_n)=\frac{e^{x^T\widehat{\beta}_n}}{1+e^{x^T\widehat{\beta}_n}}. \]
Ⓡ The code below compares the logistic model fit:
p2 <- ISLR::Default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ggtitle("Logistic regression model fit") +
xlab("Balance") +
ylab("Probability of Default")
subplot(ggplotly(p1), ggplotly(p2), nrows = 1)📈 A typical classification rule is to assign a predicted value of \(\widehat Y_i\) if \(\widehat{p}_i=\widehat{p}(x_i)>s\) , with \(s=0.5\) commonly used as a threshold.
To begin with, let us recall that for the canonical link, the likelihood equations simplify to: \[\sum_{i=1}^n\frac{(Y_i-\mu_i)x_{i,j}}{\phi} =0,\quad j=1,\cdots,p. \] Next, we consider a logistic regression model where \[Y_i\vert x_i \sim\mathcal{B}(p_i)\] with \(\mu_i=p_i=\frac{e^{x_i^T\beta}}{1+e^{x_i^T\beta}}\) and \(\phi=1\). Consequently, the likelihood equations can be expressed as: \[\sum_{i=1}^n\left(Y_i-\frac{e^{x_i^T\beta}}{1+e^{x_i^T\beta}}\right)x_{i,j} =0,\quad\forall j=1,\cdots,p.\]
📝 Remarks.
- It is important to note that there is generally no closed-form solution for these equations.
- Instead, efficient approximation algorithms, such as the Newton-Raphson algorithm, are commonly employed to find estimates.
For this chapter, we will use the binary.csv dataset,
which contains information regarding the admissions of students to a new
establishment.
The dataset includes:
admit), which is a factor with
two levels: 0 (not admitted) and 1
(admitted).gre), and the Grade Point Average
(gpa).rank, which has four levels (1,
2, 3 and 4). Note that a rank
1 establishment is more prestigious than a rank
2 establishment.Ⓡ To load and view the dataset, use the following code:
mydata<-read.csv("binary.csv",header=T,sep=",")
mydata$admit<-factor(mydata$admit)
mydata$rank<-factor(mydata$rank)
mydata%>%rmarkdown::paged_table()Ⓡ You can also obtain a summary of the dataset with:
## admit gre gpa rank
## 0:273 Min. :220.0 Min. :2.260 1: 61
## 1:127 1st Qu.:520.0 1st Qu.:3.130 2:151
## Median :580.0 Median :3.395 3:121
## Mean :587.7 Mean :3.390 4: 67
## 3rd Qu.:660.0 3rd Qu.:3.670
## Max. :800.0 Max. :4.000
To avoid any confusion regarding the levels of the response variable, let us rename them:
levels(mydata$admit)[levels(mydata$admit) %in% c('0', '1')] <-c('No', 'Yes')
table(mydata$admit) %>% as.data.frame() %>% setNames(c("Admit", "Counts")) %>% kableExtra::kbl()| Admit | Counts |
|---|---|
| No | 273 |
| Yes | 127 |
For sake of simplicity, we can define:
📝 Remarks.
- The order of the levels in the response variable is significant.
- In machine learning terminology, the class
Y=1is referred to as the positive class, while the classY=0is termed the negative class.- Typically, the reference level (the first level in the
table) is assigned to the majority class, which is considered the negative class (here,0=No). If necessary, you can change the reference level using the following command:
Split the dataset (Train and Test)
To evaluate the performance of our logistic regression
model, we need to split the dataset into training and testing sets. This
allows us to train the model on one subset of the data and validate its
performance on another, unseen subset. To achieve a more
representative split, particularly in terms of the outcome
variable (admission), we will use stratified sampling. This
ensures that the proportion of each class in the response variable is
maintained in both the training and testing sets.
Ⓡ The code is the following:
Define the mathematical framework for our analysis of student admissions.
rank, which can take on
\(J=4\) levels (1,
2, 3 and 4). Notably, an
establishment with rank 1 is considered more prestigious
than one with rank 2.`admit) with two modalities. Specifically, \(Y_{ij}=1\) corresponds to the individual
\(i\) not being admitted from rank
\(j\) (i.e., \(Y_{ij}=1\)=Yes), and \(Y_{ij}=0\) signifies admission (i.e., \(Y_{ij}=0=\)No).gre),
and \(X^{(2)}\) denoting the Grade
Point Average (gpa).We can express the model in the following form: \[x_{ij}^T\beta=\mu+\alpha_j+\beta_1 X_{ij}^{(1)}+\beta_2 X_{ij}^{(2)},\quad\text{where}\quad Y_{ij}\vert x_{ij}\sim\mathcal{B}(p(x_{ij})) \] with the expected value given by \[\text{E}[Y_{ij}\vert x_{ij}]=p(x_{ij})\quad\text{and}\quad p(x_{ij})=\frac{e^{x_{ij}^T\beta}}{1+e^{x_{ij}^T\beta}}.\] It is important to note that \(p(x_{ij})=\mathbb P(Y=1)= \mathbb P(Y=Yes)\).
glmfunctionWe will define three simple logistic regression models to explore the effects of the covariates on admission:
Model 1. This model considers only the covariate
gre: \[\text{E}[Y_{i}\vert
x_{i}]=p(x_{i})\quad\text{where}\quad
x_{i}=\mu+\beta_1 X_{i}^{(1)}.\]
Model 2. This model incorporates only the
covariate gpa: \[\text{E}[Y_{i}\vert
x_{i}]=p(x_{i})\quad\text{where}\quad
x_{i}=\mu+\beta_1 X_{i}^{(2)}.\]
Model 3. In this model, we examine the effect of
the factor rank: \[\text{E}[Y_{ij}\vert
x_{ij}]=p(x_{ij})\quad\text{where}\quad
x_{ij}=\mu+\alpha_j.\]
Ⓡ The models are implemented using the following R code:
model1 <- glm(admit ~ gre, family = "binomial", data = train)
model2 <- glm(admit ~ gpa, family = "binomial", data = train)
model3 <- glm(admit ~ rank, family = "binomial", data = train)Ⓡ We can summarize the results of each model and present the coefficients using the following code:
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -2.6751786 | 0.7217154 | -3.706695 | 0.0002100 |
| gre | 0.0031948 | 0.0011790 | 2.709660 | 0.0067352 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -4.582399 | 1.2572904 | -3.644663 | 0.0002677 |
| gpa | 1.110625 | 0.3617611 | 3.070052 | 0.0021402 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.3022809 | 0.3198465 | 0.945081 | 0.3446175 |
| rank2 | -0.7783635 | 0.3766212 | -2.066701 | 0.0387623 |
| rank3 | -1.7457336 | 0.4183470 | -4.172932 | 0.0000301 |
| rank4 | -1.9398897 | 0.5224335 | -3.713180 | 0.0002047 |
Ⓡ To visualize the impact of the covariates on the predicted probabilities of admission, we create plots for each model.
train2 <- train %>% mutate(prob = ifelse(admit == m1, 1, 0))
p1 <-train2 %>% ggplot( aes(gre, prob)) +
geom_point(alpha = 0.15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
xlab("gre") +ggtitle("Model 1")+ ylab("P(Y=1)")
p2 <-train2%>% ggplot( aes(gpa, prob)) +
geom_point(alpha = 0.15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +ggtitle("Model 2") +
xlab("gpa") +
ylab("P(Y=1)")
train3 <- broom::augment(model3, train2) %>% mutate(.fitted = exp(.fitted))
p3 <-train3 %>% ggplot( aes(rank, .fitted, color = rank)) +geom_boxplot() +
geom_rug(sides = "b", position = "jitter", alpha = 0.2, show.legend = FALSE) +
ggtitle("Model 3") +
xlab("rank") +
ylab("P(Y=1)")
subplot(ggplotly(p1), ggplotly(p2), ggplotly(p3),nrows = 1)✍Key code \(\,\)
dplyr::mutate(). This function is a
powerful tool from the dplyr package that allows for the
creation of new columns derived from existing variables within a
dataset. In this context, it is employed to create a new column that
quantifies the probability of admission, which is derived from the
binary response variable.broom::augment() The
augment() function from thebroom package plays
a crucial role in enhancing the original dataset used for modeling. It
appends additional columns, including predicted values, residuals, and
other diagnostic metrics. This functionality allows for a more intuitive
visualization and evaluation of the model’s performance, making it
easier to assess how well the logistic regression model fits the
observed data.Now, let’s extend our analysis to a complete logistic regression model, represented as follows: \[ Y_{ij}\vert x_{ij}\sim\mathcal{B}(p(x_{ij}))\quad\text{where}\quad x_{ij}^T\beta=\mu+\alpha_j+\beta_1 X_{ij}^{(1)}+\beta_2 X_{ij}^{(2)}\quad\text{and}\quad p(x_{ij})=\frac{e^{x_{ij}^T\beta}}{1+e^{x_{ij}^T\beta}}.\]
Model Declaration. There are several approaches to declare a logistic regression model in R, and we will explore a few of them here.
Using themultinom() Function from the
nnet Package. This method is particularly useful
for multinomial logistic regression, allowing for the modeling of
response variables with more than two categories.
Ⓡ
mmodel_multi <- multinom(admit~.,data=train)
glm() Function. The Generalized
Linear Model (glm) function provides a flexible framework
for modeling various types of data, including binomial outcomes. Here,
we specify the response variable admit in relation to the
covariates rank, gpa, and gre.
Ⓡ
model_glm <- glm(admit ~ rank+gpa+gre,family = "binomial",data = train )
train() Function from the
caret Package. The caret package is
invaluable for building predictive models in Ⓡ
. By utilizing the train() function, we can
efficiently create a logistic regression model while applying
cross-validation to enhance model reliability.
Ⓡ
model_caret <- train(admit ~ ., data = train,method = "glm",
family = "binomial",trControl = trainControl(method = "cv", number = 10)))
##
## Call:
## glm(formula = admit ~ rank + gpa + gre, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.847985 1.396641 -2.755 0.005866 **
## rank2 -0.699451 0.385413 -1.815 0.069554 .
## rank3 -1.707940 0.426848 -4.001 6.3e-05 ***
## rank4 -1.810166 0.531629 -3.405 0.000662 ***
## gpa 0.883593 0.409500 2.158 0.030948 *
## gre 0.001780 0.001342 1.326 0.184769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 347.84 on 278 degrees of freedom
## Residual deviance: 311.70 on 273 degrees of freedom
## AIC: 323.7
##
## Number of Fisher Scoring iterations: 4
1. Number of Fisher Scoring iterations The summary output indicates the number of iterations required for convergence. Fisher’s scoring algorithm is a variant of Newton’s method specifically designed for solving maximum likelihood problems numerically.
Ⓡ In this model, it took four iterations to achieve convergence.
## [1] 4
2. Coefficients We assess the significance of each coefficient individually using a Gaussian test. As established in Theorem 1, the following asymptotic relationship holds: \[ I^{1/2}(\widehat{\beta}_n^{ML})\sqrt{n}( \widehat{\beta}_n^{ML}-\beta_0)\overset{\mathcal{D}}{\longrightarrow}\mathcal{N}(\boldsymbol{0}_p,\textbf{I}_p). \]
Ⓡ The coefficients and their associated statistics can be extracted as follows:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.847984992 1.396641426 -2.755170 5.866159e-03
## rank2 -0.699451040 0.385413495 -1.814807 6.955360e-02
## rank3 -1.707940071 0.426847638 -4.001287 6.299879e-05
## rank4 -1.810165945 0.531628860 -3.404943 6.617787e-04
## gpa 0.883593421 0.409500429 2.157735 3.094844e-02
## gre 0.001779811 0.001342024 1.326214 1.847689e-01
Moreover, we can perform the Wald test to further evaluate the null hypothesis regarding the coefficients.
[Proposition 4]
Consider the hypothesis test: \[
\mathcal{H}_0 \ : \ \beta_j=0, \ vs \
\mathcal{H}_1\ : \ \beta_j\neq 0.
\] Under certain assumptions and under \(\mathcal{H}_0\) \[
S:=n\left[
I(\widehat{\beta}^{ML})\right]_{jj}\left(\widehat{\beta}^{ML}_j\right)^2\overset{\mathcal{D}}{\longrightarrow}\chi_{1}^2.
\] For a fixed significance level \(\alpha \in (0, 1)\), the rejection region
is defined as: \[\left\{S\geq
q_{1-\alpha}^{\chi^2_{1}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{1}}\) is the
quantile of order \(1-\alpha\) from the
chi-squared distribution with 1 degree of freedom.
## Wald test for gre
## in glm(formula = admit ~ rank + gpa + gre, family = "binomial",
## data = train)
## F = 1.758843 on 1 and 273 df: p= 0.18588
## Wald test for gpa
## in glm(formula = admit ~ rank + gpa + gre, family = "binomial",
## data = train)
## F = 4.65582 on 1 and 273 df: p= 0.031821
The Wald test can also be applied to categorical factors.
[Proposition 5]
Consider the hypothesis test: \[
\begin{cases}
\mathcal{H}_0 &
\alpha_{(-1)}&=(\alpha_2,\cdots,\alpha_J)^T=\boldsymbol{0}_{J-1},\\
\mathcal{H}_1 & \alpha_{(-1)}&\neq \boldsymbol{0}_{J-1}.
\end{cases}
\] Under certain assumptions and under \(\mathcal{H}_0\) \[S:=\left\|\sqrt
n \,\text{I}\left(\widehat{\beta}_{(-1)}^{ML}\right)\widehat{\alpha}^{ML}_{(-1)}\right\|
^2\overset{\mathcal{D}}{\longrightarrow}\chi_{J-1}^2.
\]
For a fixed significance level \(\alpha \in (0, 1)\), the rejection region is: \[\left\{S\geq q_{1-\alpha}^{\chi^2_{1}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{J-1}}\) is the quantile of order \(1-\alpha\) from the chi-squared distribution with \(J - 1\) degrees of freedom.
📝 Remark.
Note that for categorical variables, the Wald test has a different formulation due to the additional constraint (here \(\alpha_1 = 0\)).
Ⓡ Display Wald test
for categorial variable rank.
## Wald test for rank
## in glm(formula = admit ~ rank + gpa + gre, family = "binomial",
## data = train)
## F = 7.264522 on 3 and 273 df: p= 0.000105
3. Deviance To define deviance, we first introduce the concept of the saturated model:
Therefore, we compare \(\mathcal{L}\) the log of the likelihood of our model, with \(\mathcal{L}_{[m]}\), the log of the likelihood of the saturated model \([m_{sat}]\).
Example.
If \(Y_i\vert x_i\sim\mathcal{B}(p(x_i))\), then for the saturated model \([m_{sat}]\), we have: \[\widehat{\text{E}[Y_i\vert x_i]}=\widehat{p}(x_i)=y_i.\] In this case, the log of the likelihood is zero \[\mathcal{L}_{[m_{sat}]}=\sum_{i=1}^n \log\left(\widehat{p}(x_i)^{y_i} (1-\widehat{p}(x_i))^{1-y_i}\right) =0 \]
If \(Y_i\vert
x_i\sim\mathcal{B}(n,p(x_i))\), then for the saturated model
\([m_{sat}]\), we have: \[\widehat{\text{E}[Y_i\vert
x_i]}=n\widehat{p}(x_i)=y_i.\]
Here, the log of the likelihood is not zero \[\mathcal{L}_{[m_{sat}]}=\sum_{i=1}^n
\log\left(\left(^n_{y_i}\right)(\widehat{p}(x_i))^{y_i}
(1-\widehat{p}(x_i))^{1-y_i}\right) \neq 0 .
\]
[Definition 3]
We The deviance of a model \([m]\) defined with respect to the saturated
model \([m_{sat}]\) is denoted \(\mathcal{D}_{[m]}\) and is equal to \[\mathcal{D}_{[m]}=2\left(\mathcal{L}_{[m_{sat}]}-\mathcal{L}_{[m]}\right)\geq0,\]
where \(\mathcal{L}_{[m_{sat}]}\) and
\(\mathcal{L}_{[m]}\) represent the log
likelihoods in the saturated model and in the model \([m]\), respectively.
📝 Remark.
It is evident that a higher deviance \(\mathcal{D}_{[m]}\) indicates a poorer fit for the model \([m]\).
Ⓡ The residual deviance corresponds to the value of the deviance of our model \([\texttt{mod}]\) \[\mathcal{D}_{[\texttt{mod}]}(\widehat{p})=2(\mathcal{L}_{[m_{sat}]}(\widehat{p})-\mathcal{L}_{[\texttt{mod}]}(\widehat{p}))=-2\mathcal{L}_{[\texttt{mod}]}(\widehat{p})\]
## [1] 311.7008
Ⓡ The null deviance represents the value of the deviance of the null model (which includes only the intercept): \[ \mathcal{D}_{[\texttt{nul}]}(\widehat{p})=-2\mathcal{L}_{[\texttt{nul}]}(\widehat{p})=-2\mathcal{L}_{[\texttt{nul}]}(\overline{y}) \]
## [1] 347.8364
📈 The deviance of our model is significantly lower than the null deviance (the deviance of the model reduced to the intercept). This indicates that our model provides a better fit than the null model.
In a linear regression setting, residuals are defined as the difference between the observed values \(y_i\) and the predicted values \(\widehat{\text{E}[Y_i\vert x_i]}=\widehat y_i\). However, in a more general context, it is possible that \(\widehat{\text{E}[Y_i\vert x_i]}\neq \widehat y_i\).
😕 In logistic regression, the outcome data is discrete, and consequently, the residuals are also discrete. Therefore, plots of raw residuals from logistic regression are generally not very informative.
1. Response residuals In logistic regression, the residuals are defined as the difference between the observed values \(y_i\) and the predicted probabilities \(\widehat{p}(x_{i})=g^{-1}(x_{i}^T\widehat\beta)\): \[ res_{response_i}:=\widehat\epsilon_{i}=y_{i}-\widehat{p}_{i}. \]
Ⓡ To display the response residuals, you can use the following command:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7175 -0.2999 -0.1586 0.0000 0.4522 0.9018
2. Pearson residuals The Pearson residuals are obtained by normalizing the response residuals \(res_{response_i}\) using the estimated variance \(\widehat{\text{Var}(Y_{i})}\), the estimated variance of \(Y_{i}\). In a logistic context, this variance is given by: \[\widehat{\text{Var}(Y_{i})}=\widehat{p}(x_{i})( 1-\widehat{p}(x_{i})).\] The Pearson residuals are then defined as: \[ res_{pearson_i} =\frac{res_{response_i}}{\sqrt{\widehat{p}(x_{i})( 1-\widehat{p}(x_{i}))}} \]
Ⓡ To display Pearson residuals, use the following command:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.593811 -0.654480 -0.434174 -0.001998 0.908498 3.030883
3. Standardized Pearson residuals The Standardized Pearson residuals \(res_{Stand.pearson_i}\) are calculated by normalizing the Pearson residuals \(res_{pearson_i}\) with the leverage effect: \[ res_{Stand.pearson_i} =\frac{res_{pearson_i}}{\sqrt{{(1-h_{ii}))}}}, \] where \(h_{ii}\) is the \(i^{th}\) diagonal element of the projection matrix \(H=X(X^TX)^{-1}X^T\) in the full rank setting of the matrix \(X\).
Ⓡ To display these residuals, use:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.618585 -0.661913 -0.437873 -0.001867 0.921366 3.056838
4. Deviance residuals
The summary of model_glm includes the deviance
residuals, which are suitable for generalized models. These residuals
measure how far \(\mathcal{L}_{[m]}(\widehat\beta,y)\) for
the \(i\)-th observation is
from the saturated \(\mathcal{L}_{[m_{sat}]}(\widehat\beta_{sat},y)\).
To define them:
The deviance is calculated as: \[\mathcal{D}_{[m]}=2\left( \mathcal{L}_{[m_{sat}]}(\widehat\beta_{sat},y)-\mathcal{L}_{[m]}(\widehat\beta,y) \right) \] where \(\widehat\beta\) and \(\widehat\beta_{sat}\) are the maximum likelihood estimators obtained from the models \([m]\) and \([m_ {sat}]\), respectively.
For any model \([m]\),, the log of the likelihood \(\mathcal{L}_{[m]}\) can be expressed as a sum of contributions from each data point: \[ \mathcal{L}_{[m]}(\beta,y)=\sum_{i=1}^n \log\left( p(x_i)^{y_i} (1- p(x_i))^{1-y_i}\right):=\sum_{i=1}^n \mathcal{L}_{[m]}(\beta,y)|_i \]
Thus, the deviance can also be expressed as a sum of contributions from each observation: \[ \mathcal{D}_{[m]}:=\sum_{i=1}^n d_i \] where \(d_i\) represents the deviance due to observation \(i\). Using this formulation, the deviance residual for observation \(i\) is defined as: \[ res_{dev_i} ={\rm sign}(y_i-\widehat{p}_i)\sqrt{ d_i}. \] Ⓡ To display the deviance residuals, use the following command:
#by default : model_glm%>% residuals() %>% summary()
model_glm%>% residuals(type = "deviance") %>% summary()## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.5901 -0.8444 -0.5877 -0.1047 1.0971 2.1545
5. Standardized deviance residual
Standardized deviance residuals measure how far \(\mathcal{L}_{[m]}(\widehat\beta,y)\) for the \(i\)-th observation is from \(\mathcal{L}_{[m_{sat}]}(\widehat\beta_S,y)\) for the same observation, but they are normalized through the leverage effect: \[ res_{stand\_dev_i} =\sqrt{\frac{res_{dev_i}}{(1-h_{ii})}}. \]
Ⓡ To compute these residuals, use:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6148 -0.8537 -0.5927 -0.1055 1.1121 2.1730
📝 Remark.
If the model is “good,” we can show that the residuals are asymptotically Gaussian (this can be verified using a Q-Q plot).
[Which Residuals Should You Focus On?]
As is often the case in statistics, the answer depends on the context!
1. Residuals vs. Fitted plot
The Residuals vs. Fitted plot is typically used to check for trends in the residuals. This plot utilizes Pearson residuals.
😕 In logistic regression, the discrete nature of the residuals means that plots of raw residuals are generally not informative.
2. Scale-Location plot The Scale-Location plot is commonly employed to detect heteroskedasticity. Here, Standardized Pearson residuals \(res_{Stand.pearson_i}\) are used here.
😕 Due to the nature of the response variable \(Y\), the residuals are often heteroskedastic, rendering this plot less useful.
3. Normal Q-Q plot_ In a linear setting, the Q-Q plot is used to assess the normality of the residuals.
😕 Although standardized Pearson residuals have an approximate standard normal distribution (as noted in Agresti, 2002), interpreting the plot can be challenging. Therefore, this Normal Q-Q plot is not particularly useful.
1. Normal Q-Q plot Ⓡ We can generate a Normal Q-Q plot based on either Standardized Pearson residuals or Standardized Deviance residuals.
residual_glm<-rstandard(model_glm,type = 'deviance')
#residual_glm<-rstandard(model_glm,type = 'pearson')
ggplot(train,aes(sample = residual_glm,col=admit)) +geom_qq(size=1)+geom_qq_line(col='black')+ facet_wrap(~admit)+
ylab("Standardized deviance residuals Quantiles") + xlab("Theoretical Quantiles") +
ggtitle("Normal Q-Q plot") 📝 Remarks.
- The function
geom_qq(distribution = stats::qnorm)defaults to a normal distribution.- The
geom_qq_line()function adds a line to the theoretical Q-Q plot that passes through the first and third quartiles.
📈 The quantiles of our sampled data closely align with the theoretical quantiles, indicating that our residuals are approximately normally distributed.
2. An alternative to Residuals vs Fitted plot It is essential to ensure that there is no underlying structure or trend in the residuals. If a trend is detected, it may be necessary to investigate the cause, such as a poor model fit or a nonlinear relationship.
The Binned Residuals Plot serves as an alternative to the Residuals vs. Fitted plot. It divides the data into categories (bins) based on their fitted values, plotting the average residual against the average fitted value for each bin.
Expected values) versus the average
Response residuals (Average residual) for each
bin.Expected residuals. The difference between the proportion of Yes responses observed among the points in a bin and the average predicted probability for those points. \[ res_{Expected_j}:=\frac{\sum_{bin_j}y_i}{\#\,bin_j}-\frac{\sum_{bin_j}\widehat{p}(x_i)}{\#\,bin_j} \] Example. For a group of 20 points with 11 Yes responses and an average prediction of 0.6, the expected residual would be: (11/20 - 0.6)=-0.05.
Average residuals. \[ res_{Average_j}:=\frac{\sum_{bin_j}(y_i-\widehat{p}(x_i))}{\#\,bin_j}=\frac{\sum_{i\in bin_j}res_{response_i}}{\#\,bin_j} \]
Ⓡ The
binnedplot function from the arm package is
used to create this binned residual plot.
📈 The graph produced by
binnedplot also displays a 95% prediction interval
(indicated by blue lines) for the mean residuals. If the binomial model
is adequate, approximately 95% of the residuals should fall within this
interval, which is indeed the case here.
📝 Remarks.
- By default,
binnedplotselects the number of groups based on a balance between having enough points per bin (to ensure accurate average proportions) and enough bins (to observe trends if they exist). For \(n>100\), the number of groups chosen is approximately \(\sqrt n\).- The
binnedplotfunction can also be used to investigate residuals in relation to individual predictors (see Gelman, page 98, for an example).
In logistic regression, the coefficients can be interpreted as Odds Ratios (OR), which provide insight into the relationship between predictor variables and the probability of a particular outcome.
[Definition 4]
The odd for an individual \(x\) to obtain the answer \(Y=1\) is defined by
\[odds(x)=\frac{p(x)}{1-p(x)},\quad\text{where}\quad
p(x)=P(Y=1/x).\]
📝 Remark.
Odds and probabilities represent the same underlying information but in different forms: \[ odds=\frac{p}{1−p}\quad \text{and}\quad p=\frac{odds}{1+odds}. \]
Example. An odds of 1 is equivalent to a probability of \(p=0.5\)
The ratio of two odds is termed the Odds Ratio (OR) and is defined as follows.
[Definition 5]
The Odds ratio (OR) between 2 individuals
\(x\) and \(x'\) is defined as: \[OR(x,x')=\frac{odds(x)}{odds(x')}=\frac{p(x)(1-p(x'))}{p(x')(1-p(x))}.\]
Use of OR. Odds Ratios allow for the comparison of probabilities of “success” between individuals \(x\) and \(x'\):
Other interpretation: change in probability. An Odds Ratio can also be interpreted in terms of a change in probability.
Example. A change in probability from \(p_1=0.33\) to \(p_2=0.5\) results in an OR of 2, as follows: \[ \frac{ \frac{0.5}{0.5}}{ \frac{0.33}{0.66}}=2 \] This indicates that the odds of success have doubled.
Other interpretation: percentage Increase in Odds. We can also interpret the OR as reflecting the percentage increase in the odds of an event. Specifically, an OR of 2 implies that the odds have increased by 100%.
Example. If the initial odds are calculated as follows: - For \(p_1 = 0.33\) and \(p_2= 0.R\): \[ odds_1 = \frac{0.33}{1 - 0.33} = \frac{0.33}{0.67} \approx 0.49\quad\text{and}\quad odds_2 = \frac{0.5}{1 - 0.5} = \frac{0.5}{0.5} = 1. \]
The increase in odds from approximately 0.49 to 1 represents a change of: \[ \text{Percentage Increase} = \left( \frac{odds_2 - odds_1}{odds_1} \right) \times 100\% = \left( \frac{1 - 0.49}{0.49} \right) \times 100\% \approx 104.08\%. \]
This shows that moving from an odds of approximately 0.49 to 1 corresponds to an increase of over 104%, which aligns with the interpretation of the Odds Ratio as an indicator of relative change.
Other interpretation: relative risk. The OR can be related to relative risk. For probabilities \(p (x)\) and \(p(x')\) that are very small compared to 1, the OR can be approximated as: \(OR (x,x') \approx p (x)/p (x'),\) which provides a relative interpretation between \(x\) and \(x'\).
Example. In the context of a rare disease (where \(Y=1\) indicates the presence of the disease), if: \[OR(x',x)\sim p(x)/p(x')=4,\] it indicates that the response (illness) is four times more likely for individual \(x\) than for individual \(x'\)
Other interpretation : Impact of an explanatory variables. In a logistic regression model, the log-odds of the probability can be expressed as: \[ \begin{array}{l} logit(p(x))=\beta_0+\beta_1x_1+\ldots+\beta_px_p,\\ logit(p(x'))=\beta_0+\beta_1x'_1+\ldots+\beta_px'_p. \end{array} \] Consequently, the Odds Ratio can be expressed as: \[ OR(x,x')=\exp\left( \beta_1(x_1-x'_1)+\ldots+\beta_p(x_p-x'_p) \right). \] To examine the influence of an explanatory variable \(x_j\) on the Odds Ratio, we consider two observations \(x\) and \(x'\) that differ only in the \(j\)-th variable (i.e., \(x_i=x'_i\) for all \(i\neq j\)and \(x_j\neq x'_j\)). The Odds Ratio simplifies to: \[ OR(x,x')=\exp\left( \beta_j(x_j-x'_j) \right). \] If, \(x_j-x'_j=1\) then \(OR(x,x')=\exp( \beta_j),\) which allows for the study of the influence of the variable \(j\) on the response variable without dependency on the values of \(x\). These values can be obtained directly from Ⓡ .
Example. Consider two predictors \(X^{(1)}\) and \(X^{(2)}\) such that \[\begin{array}{ll} \beta_1=0.5&\Rightarrow\text{e}^{\beta_1}=\text{e}^{0.5}\approx 1.65\\ \beta_2=-0.15&\Rightarrow\text{e}^{\beta_2}=\text{e}^{-0.15}\approx 0.86 \end{array} \]
These exponentiated coefficients can be interpreted as the percentage or multiplicative change in the odds associated with a one-unit increase/decrease in the predictor (while holding the other predictors constant): \[\begin{array}{lll} \text{Predictor 1:}&\text{from 1 to 1.65 }&\Rightarrow\text{ 65\% of increase}\\ \text{Predictor 2:}&\text{from 1 to 0.86 }&\Rightarrow\text{ 14\% of decrease} \end{array} \]
Interpretation in Ⓡ . For a given explanatory variable
In this section, we will calculate the Odds Ratios (OR) for our logistic regression model, including their confidence intervals and associated pp-values. The null hypothesis (\(\mathcal{H}0\)) posits that there is no effect, while the alternative hypothesis (\(\mathcal{H}1\)) suggests the presence of an effect.
Ⓡ
Displaying Odds Ratios and Confidence Intervals To
obtain the Odds Ratios along with their confidence intervals and
p-values, we can use the odds.ratio function from
the questionr package. This will provide a clear summary of
how each predictor variable affects the outcome.
| OR | 2.5 % | 97.5 % | p | |
|---|---|---|---|---|
| (Intercept) | 0.0213227 | 0.0012976 | 0.3150431 | 0.0058662 |
| rank2 | 0.4968580 | 0.2308283 | 1.0529672 | 0.0695536 |
| rank3 | 0.1812387 | 0.0768620 | 0.4125094 | 0.0000630 |
| rank4 | 0.1636270 | 0.0544679 | 0.4460768 | 0.0006618 |
| gpa | 2.4195787 | 1.0950123 | 5.4800698 | 0.0309484 |
| gre | 1.0017814 | 0.9991697 | 1.0044581 | 0.1847689 |
📈 Interpretation
Intercept. An OR of 0.0213227`
indicates that when all predictors are at their reference levels, the
odds of the outcome occurring are very low.
rank2. An OR of 0.496858 suggests
that moving from the reference rank to rank 2 reduces the odds of the
outcome by approximately 50.31% compared to the reference rank, but this
result is marginally significant 0.0695536.
rank3. An OR of 0.1812387 indicates
that moving to rank 3 decreases the odds of the outcome
significantly (about 81.88 %) compared to the reference rank. This is
highly significant (6.2998787^{-5}).
rank4. Similarly, an OR of 0.163627
indicates that moving to rank 4 decreases the odds of the
outcome by approximately 83.64 %. This result is also highly significant
(6.6177874^{-4}).
gpa. An OR of 2.4195787 suggests
that for each unit increase in gpa, the odds of the outcome
occurring increase by about -83.64 %. This effect is statistically
significant (0.0309484).
gre. An OR of 1.0017814 indicates a
negligible increase in odds for each unit increase in gre
score, and the p-value ((0.1847689)) suggests that this effect
is not statistically significant.
The 2.5% and 97.5% confidence intervals provide
a range of values for the OR. For instance, the OR for
rank 3 ranges from 0.076862 to 0.4125094, which does not
include 1, indicating a significant effect. Conversely, for
gre, the confidence interval includes 1 (0.9991697 to
1.0044581, suggesting that it may not have a significant
impact.
Ⓡ Visualizing Odds Ratios with a Forest Plot A forest plot is an effective way to visually represent the Odds Ratios, their confidence intervals, and significance levels. We can use the forest_model function from the forestmodel package to generate this plot.
We start with a Type I test, which evaluates the significance of each predictor in the model sequentially. The anova() function output shows the deviance reductions for each predictor added sequentially.
Ⓡ Here’s the summary:
📈 These results suggest that both
rank and gpa add value to the model, whereas
gre may not be a necessary predictor.
Next, we perform Type II tests using the
drop1 function. This function removes each variable one at
a time and evaluates the change in deviance to determine if the model
fit significantly worsens without that variable. This approach is useful
to check if each variable contributes significantly to the model
(note: this method is not compatible with
multinom models).
⚠️This method is not compatibleis not compatible with
multinommodels.
Ⓡ Here’s the
drop1()output:
📈 Both Type I and Type II tests
highlight rank and gpa as significant
predictors, while gre appears unnecessary.
Based on the results above, we’ll explore different model
configurations to identify the best-fitting model for predicting
admit.
Ⓡ Here are the models:
model_glm<-glm(admit~rank+gpa+gre,family='binomial',data=train)
modF2<-glm(admit~rank+gre,family='binomial',data=train)
modF<-glm(admit~rank+gpa,family='binomial',data=train)
modNull<-glm(admit~1,family='binomial',data=train)To compare these models, we can use the LRstats()
function from the vcdExtra package, which provides an
asymptotic Goodness-of-Fit test by deviance to check if a model \([mod]\) of size \(m\) is adequate to explain the data.
\[ \begin{cases} \mathcal{H}_0 :& \mathcal{D}_{[mod]}=0&\text{model is adequate}\\ \mathcal{H}_1 :& \mathcal{D}_{[mod]}>0&\text{model is inaadequate}\\ \end{cases} \]
📝 Remarks.
- This test assesses if the model differs significantly from a saturated model.
- Under \(\mathcal{H}_0\), the test implies that there is no significant difference between the models.
[Proposition 6]
Under certain conditions and under \(\mathcal{H}_0\) \[\mathcal{D}_{[mod]}\overset{\mathcal{D}}{\longrightarrow}
\chi^2_{n-m},
\] where \(n−m\) represents the
degrees of freedom. For a given significance level \(\alpha\in(0,1)\), the rejection region is:
\[\left\{\mathcal{D}_{[mod]}\geq
q_{1-\alpha}^{ \chi^2_{n-m}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{n-m}}\) is the
quantile of order \(1-\alpha\) from the
chi-squared distribution with \(n-m\)
degrees of freedom.
Ⓡ Here’s the
LRstats()output:
📈 Interpretation
AIC and
BIC values are lowest for modF and
model_glm, suggesting they are the best-fitting models
among those considered.model_glm) is nearly significant (\(\approx\) 0.05), suggesting it fits
reasonably well but may include unnecessary complexity.modF is the preferred model, as it balances model fit
with simplicity, making it both interpretable and efficient for
prediction.To delve deeper, we compare nested models using a Likelihood Ratio (LR) test. This test evaluates whether adding parameters (moving from a simpler model to a more complex model) significantly improves fit.
[Proposition 7]
Consider two nested models \([m_0]\) and \([m_1]\), where (\([m_0]\subset [m_1]\)). The hypotheses are: \[ \begin{cases} \mathcal{H}_0 :& [m_0] &\text{ is adequate}\\ \mathcal{H}_1 :& [m_1] &\text{ is adequate}\\ \end{cases} \]
Under \(\mathcal{H}_0\),the LR test statistic is defined as: \[\text{LRtest}:=2(\mathcal{L}_{[m_1]}-\mathcal{L}_{[m_0]})=(\mathcal{D}_{[m_0]}-\mathcal{D}_{[m_1]})\overset{\mathcal{D}}{\longrightarrow} \chi^2_{m_1-m_0}.\] where \(m_1−m_0\) represents the degrees of freedom. For a given significance level \(\alpha\in(0,1)\), the rejection region is: \[\left\{\text{LRtest}\geq q_{1-\alpha}^{ \chi^2_{m_1-m_0}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{n-m}}\) is the quantile of order \(1-\alpha\) from the chi-squared distribution with \(m_1-m_0\) degrees of freedom.
Ⓡ To perform that
test, we can use the lrtest function of the
lmtestpackage.
📈 The result confirms that
modF is preferable over model_glm.
Testing for Interactions. Lastly, we consider
whether interaction terms improve the model fit. We introduce an
interaction between gpa and rank:
📈 The test indicates that adding the
interaction term does not improve the model, so we maintain
modF as the preferred model.
Graphical representation of each predictor’s effect on the target
variable can provide valuable insights into model behavior, especially
in understanding how individual variables influence predictions. The
allEffects function from the effects package
allows us to plot the estimated effects of each predictor in the
logistic model, helping us visualize trends and relationships.
⚠️The
effectspackage works forglmmodels but is not compatible withmultinommodels.
📈 Interpretation
gpa on Admission. The plot
shows the relationship between gpa and the likelihood of
admission. As gpa increases, the likelihood of admission
also increases. Students with a higher gpa have a greater
probability of being admitted. This aligns with the model’s intuition
that academic performance (reflected in gpa) positively
influences admission outcomes.rank on Admission. The plot
for rank illustrates the probability of admission across
different ranks of the applicant’s undergraduate institution. As
rank improves (lower rank number, indicating a more
prestigious institution), the likelihood of admission increases
significantly. Applicants from higher-ranked schools (lower numerical
rank value) have a better chance of admission. The model
attributes significant weight to the rank variable,
suggesting it’s a crucial factor in the admission decision.gpa and rank each contribute to
the predicted probability of admission. Such comparisons are essential
in assessing variable importance and informing strategies for improving
admission odds. These plots provide actionable insights. For example,
applicants might prioritize institutions with a higher rank
or aim to improve their gpa to enhance their chances.